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Abstract.  Intuitively,  if  a  density  operator  has  small  rank,  then  it  should 
be  easier  to  estimate  from  experimental  data,  since  in  this  case  only  a  few 
eigenvectors  need  to  be  learned.  We  prove  two  complementary  results  that 
confirm  this  intuition.  Firstly,  we  show  that  a  low-rank  density  matrix  can 
be  estimated  using  fewer  copies  of  the  state,  i.e.  the  sample  complexity  of 
tomography  decreases  with  the  rank.  Secondly,  we  show  that  unknown  low- 
rank  states  can  be  reconstructed  from  an  incomplete  set  of  measurements,  using 
techniques  from  compressed  sensing  and  matrix  completion.  These  techniques 
use  simple  Pauli  measurements,  and  their  output  can  be  certified  without  making 
any  assumptions  about  the  unknown  state.  In  this  paper,  we  present  a  new 
theoretical  analysis  of  compressed  tomography,  based  on  the  restricted  isometry 
property  for  low-rank  matrices.  Using  these  tools,  we  obtain  near-optimal  error 
bounds  for  the  realistic  situation  where  the  data  contain  noise  due  to  finite 
statistics,  and  the  density  matrix  is  full-rank  with  decaying  eigenvalues.  We  also 
obtain  upper  bounds  on  the  sample  complexity  of  compressed  tomography,  and 
almost-matching  lower  bounds  on  the  sample  complexity  of  any  procedure  using 
adaptive  sequences  of  Pauli  measurements.  Using  numerical  simulations,  we 


^  Author  to  whom  any  correspondence  should  be  addressed. 


@®®@ 


Content  from  this  work  may  be  used  under  the  terms  of  the  Creative  Commons  Attribution-NonCommerciai- 
ShareAlike  3.0  iicence.  Any  further  distribution  of  this  work  must  maintain  attribution  to  the  author(s)  and  the  titie 


of  the  work,  journai  citation  and  DOi. 


New  Journal  of  Physics  14  (2012)  095022 
1 367-2630/1 2/095022+28$33.00 


©  iOP  Pubiishing  Ltd  and  Deutsche  Physikaiische  Geseiischaft 


2 


lOP  Institute  of  Physics  <J)  Deutsche  physikalische  gesellschaft 


compare  the  performance  of  two  compressed  sensing  estimators — the  matrix 
Dantzig  selector  and  the  matrix  Lasso — with  standard  maximum-likelihood 
estimation  (MLE).  We  find  that,  given  comparable  experimental  resources, 
the  compressed  sensing  estimators  consistently  produce  higher  fidelity  state 
reconstructions  than  MLE.  In  addition,  the  use  of  an  incomplete  set  of 
measurements  leads  to  faster  classical  processing  with  no  loss  of  accuracy. 
Einally,  we  show  how  to  certify  the  accuracy  of  a  low-rank  estimate  using  direct 
fidelity  estimation,  and  describe  a  method  for  compressed  quantum  process 
tomography  that  works  for  processes  with  small  Kraus  rank  and  requires  only 
Pauli  eigenstate  preparations  and  Pauli  measurements. 
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1.  Introduction 

In  recent  years,  there  has  been  amazing  progress  in  studying  complex  quantum  mechanical 
systems  under  controlled  laboratory  conditions  [1].  Quantum  tomography  of  states  and 
processes  is  an  invaluable  tool  used  in  many  such  experiments,  because  it  enables  a  complete 
characterization  of  the  state  of  a  quantum  system  or  process  (see,  e.g.,  [2-16]).  Unfortunately, 
tomography  is  very  resource  intensive,  and  scales  exponentially  with  the  size  of  the  system.  Eor 
example,  a  system  of  n  spin- 1/2  particles  (qubits)  has  a  Hilbert  space  with  dimension  d  =  2", 
and  the  state  of  the  system  is  described  by  a  density  matrix  with  =  4"  entries. 
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Recently,  a  new  approach  to  tomography  was  proposed:  compressed  quantum  tomography, 
based  on  techniques  from  compressed  sensing  [17,  18].  The  basic  idea  is  to  concentrate  on  states 
that  are  well  approximated  by  density  matrices  of  rank  r  <^d.  This  approach  can  be  applied  to 
many  realistic  experimental  situations,  where  the  ideal  state  of  the  system  is  pure,  and  physical 
constraints  (e.g.  low  temperature  or  the  locality  of  interactions)  ensure  that  the  actual  (noisy) 
state  still  has  low  entropy. 

This  approach  is  convenient  because  it  does  not  require  detailed  knowledge  about  the 
system.  However,  note  that  when  such  knowledge  is  available,  one  can  use  alternative 
formulations  of  compressed  tomography,  with  different  notions  of  sparsity,  to  further  reduce 
the  dimensionality  of  the  problem  [19].  We  will  compare  these  methods  in  section  6.2. 

The  main  challenge  in  compressed  tomography  is  how  to  exploit  this  low-rank  structure, 
when  one  does  not  know  the  subspace  on  which  the  state  is  supported.  Consider  the  example  of 
a  pure  quantum  state.  Since  pure  states  are  specified  by  only  0{d)  numbers,  it  seems  plausible 
that  one  could  be  reconstructed  after  measuring  only  0{d)  observables,  compared  with  0(d^) 
for  a  general  mixed  state.  While  this  intuition  is  indeed  correct  [20-23],  it  is  a  challenge  to 
devise  a  practical  tomography  scheme  that  takes  advantage  of  this.  In  particular,  one  is  restricted 
to  those  measurements  that  can  be  easily  performed  in  the  laboratory;  furthermore,  one  then  has 
to  find  a  pure  state  consistent  with  measured  data  [24],  preferably  by  some  procedure  that  is 
computationally  efficient  (note  that,  in  general,  finding  minimum-rank  solutions  is  NP-hard, 
hence  computationally  intactable  [25]). 

Compressed  tomography  provides  a  solution  that  meets  all  these  practical  require¬ 
ments  [17,  18].  It  requires  measurements  of  two-outcome  Pauli  observables,  which  are  feasi¬ 
ble  in  many  experimental  systems.  In  total,  it  uses  a  random  subset  of  m  =  O  (rd  log  d)  Pauli 
observables,  which  is  just  slightly  more  than  the  Oird)  degrees  of  freedom  that  specify  an  ar¬ 
bitrary  rank  r  state.  Then  the  density  matrix  p  is  reconstructed  by  solving  a  convex  program. 
This  can  be  done  efficiently  using  general-purpose  semidefinite  program  (SDP)  solvers  [26]  or 
specialized  algorithms  for  larger  instances  [27-29].  The  scheme  is  robust  to  noise  and  contin¬ 
ues  to  perform  well  when  the  measurements  are  imprecise  or  when  the  state  is  only  close  to  a 
low-rank  state. 

Here,  we  follow  up  on  [17,  18]  by  giving  a  stronger  (and  completely  different)  theoretical 
analysis,  which  is  based  on  the  restricted  isometry  property  (RIP)  [30-32].  This  answers  a 
number  of  questions  that  could  not  be  addressed  satisfactorily  using  earlier  techniques  based  on 
dual  certificates.  Firstly,  how  large  is  the  error  in  our  estimated  density  matrix  when  the  true  state 
is  full-rank  with  decaying  eigenvalues?  We  show  that  the  error  is  not  much  larger  than  the  ‘tail’ 
of  the  eigenvalue  spectrum  of  the  true  state.  Secondly,  how  large  is  the  sample  complexity  of 
compressed  tomography,  i.e.  how  many  copies  of  the  unknown  state  are  needed  to  estimate 
its  density  matrix?  We  show  that  compressed  tomography  achieves  nearly  optimal  sample 
complexity  among  all  procedures  using  Pauli  measurements  and,  surprisingly,  the  sample 
complexity  of  compressed  tomography  is  nearly  independent  of  the  number  of  measurement 
settings  m  as  long  asm  ^  Q  {rd  poly  log  d). 

In  addition,  we  use  numerical  simulations  to  investigate  the  question:  given  a  fixed  time 
T  during  which  an  experiment  can  be  run,  is  it  better  to  do  compressed  tomography  or  full 
tomography,  i.e.  is  it  better  to  use  a  few  measurement  settings  and  repeat  them  many  times  or  do 
all  possible  measurements  with  fewer  repetitions?  For  the  situations  we  simulate,  we  find  that 
compressed  tomography  provides  better  accuracy  at  a  reduced  computational  cost  compared  to 
standard  maximum-likelihood  estimation  (MLE). 
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Finally,  we  provide  two  useful  tools:  a  procedure  for  certifying  the  accuracy  of  low- 
rank  state  estimates  and  a  very  simple  compressed  sensing  technique  for  quantum  process 
tomography. 

We  now  describe  these  results  in  more  detail. 

Theoretical  analysis  using  RIP.  We  use  a  fundamental  geometric  fact:  the  manifold  of  rank-r 
matrices  in  can  be  embedded  into  O  (rd  poly  log  d)  dimensions,  with  low  distortion  in  the 
two-norm.  An  embedding  that  does  this  is  said  to  satisfy  the  RIP  [32].  In  [30],  it  was  shown 
that  such  an  embedding  can  be  realized  using  the  expectation  values  of  a  random  subset  of 
O  (rd  poly  log  d)  Pauli  matrices.  This  implies  the  existence  of  the  so-called  ‘universal’  methods 
for  low-rank  matrix  recovery:  there  exists  Si  fixed  set  of  0 (rd  poly  log  J)  Pauli  measurements, 
which  has  the  ability  to  reconstruct  every  rank-r  d  x  d  matrix.  Moreover,  with  high  probability, 
a  random  choice  of  Pauli  measurements  will  achieve  this.  (The  earlier  results  of  [17]  placed  the 
quantifiers  in  the  opposite  order:  for  every  rank-r  d  x  d  matrix  p,  most  sets  of  0(rd  poly  log  d) 
Pauli  measurements  can  reconstruct  that  particular  matrix  p.) 

Intuitively,  the  RIP  says  that  a  set  of  random  Pauli  measurements  is  sensitive  to  all  low-rank 
errors  simultaneously.  This  is  important,  because  it  implies  stronger  error  bounds  for  low-rank 
matrix  recovery  [31].  These  bounds  show  that,  when  the  unknown  matrix  p  is,  full-rank,  our 
method  returns  a  (certifiable)  rank-r  approximation  of  p  that  is  almost  as  good  as  the  best  such 
approximation  (corresponding  to  the  truncated  eigenvalue  decomposition  of  p). 

In  [31],  these  error  bounds  were  used  to  show  the  accuracy  of  certain  compressed  sensing 
estimators,  for  measurements  with  additive  Gaussian  noise.  Here,  we  use  them  to  upper-bound 
the  sample  complexity  of  our  compressed  tomography  scheme.  (That  is,  we  bound  the  errors 
due  to  estimating  each  Pauli  expectation  value  from  a  finite  number  of  experiments.)  Roughly 
speaking,  we  show  that  our  scheme  uses  O(r^d^logd)  copies  to  characterize  a  rank-r  state 
(up  to  constant  error  in  trace  norm).  When  r  =  d,  this  agrees  with  the  sample  complexity  of 
full  tomography.  Our  proof  assumes  a  binomial  noise  model,  but  minor  modifications  could 
extend  this  result  to  other  relevant  noise  models,  such  as  multinomial,  Gaussian  or  Poissonian 
noise. 

Furthermore,  we  show  an  information-theoretic  lower  bound  for  tomography  of  rank-r 
states  using  adaptive  sequences  of  single-copy  Pauli  measurements:  at  least  (r^d^/log  d) 
copies  are  needed  to  obtain  an  estimate  with  constant  accuracy  in  the  trace  distance.  This 
generalizes  a  result  from  [33]  for  pure  states.  Therefore,  our  upper  bound  on  the  sample 
complexity  of  compressed  tomography  is  nearly  tight,  and  compressed  tomography  nearly 
achieves  the  optimal  sample  complexity  among  all  possible  methods  using  Pauli  measurements. 

Our  observation  that  incomplete  sets  of  observables  are  often  sufficient  to  unambiguously 
specify  a  state  gives  rise  to  a  new  degree  of  freedom  when  designing  experiments:  when  aiming 
to  reduce  statistical  noise  in  the  reconstruction,  one  can  either  estimate  a  small  set  of  observables 
relatively  accurately  or  else  a  large  (e.g.  complete)  set  of  observables  relatively  coarsely.  Our 
bounds  (as  well  as  our  numerics)  show  that,  remarkably,  over  a  very  large  range  of  m  the  only 
quantity  relevant  for  the  reconstruction  error  is  t,  the  total  number  of  experiments  performed. 
It  does  not  matter  over  how  many  observables  the  repetitions  are  distributed.  Thus,  fixing  t 
and  varying  m,  the  reduction  in  the  number  of  observables  and  the  increase  in  the  number  of 
measurements  per  observable  have  no  net  effect  with  regard  to  the  fidelity  of  the  estimate  as  long 
asni^  Q  {rd  poly  log  d).  This  finding  holds  independently  of  whether  fidelity  or  trace  distance 
is  used  to  measure  the  reconstruction  error,  and  we  believe  that  it  is  plausible  that  a  similar 
phenomenon  occurs  for  other  cost  functions  as  well. 
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Certification.  We  generalize  the  technique  of  direct  fidelity  estimation  (DFE)  [33,  34]  to  work 
with  low-rank  states.  Thus,  one  can  use  compressed  tomography  to  get  an  estimated  density 
matrix  p,  and  use  DFE  to  check  whether  p  agrees  with  the  true  state  p.  This  check  is  guaranteed 
to  be  sound  even  if  the  true  state  p  is  not  approximately  low  rank.  Our  extension  of  DFE  may  be 
of  more  general  interest,  since  it  can  be  used  to  efficiently  certify  any  estimate  p  regardless  of 
whether  it  was  obtained  using  compressed  sensing  or  not,  as  long  as  the  rank  r  of  the  estimate 
is  small  (and  regardless  of  the  ‘true’  rank). 

Numerical  simulations.  We  compare  the  performance  of  several  different  estimators  (methods 
for  reconstructing  the  unknown  density  matrix).  They  include:  constrained  trace-minimization 
(also  known  as  the  matrix  Dantzig  selector),  least  squares  with  trace-norm  regularization  (or  the 
matrix  Lasso),  as  well  as  a  standard  MLE  [35-37]  for  comparison. 

We  observe  that  our  estimators  outperform  MLE  in  essentially  all  aspects,  with  the  matrix 
Lasso  giving  the  best  results.  The  fidelity  of  the  estimate  is  consistently  higher  using  the 
compressed  tomography  estimators.  Also,  the  accuracy  of  the  compressed  sensing  estimates 
is  (as  mentioned  above)  fairly  insensitive  to  the  number  of  measurement  settings  m  (assuming 
that  the  total  time  available  to  run  the  experiment  is  fixed).  So  by  choosing  m  d^,  one  still 
obtains  accurate  estimates,  but  with  much  faster  classical  post-processing,  since  the  size  of  the 
data  set  scales  as  0{m)  rather  than  0{d^). 

It  may  be  surprising  to  the  reader  that  we  outperform  MLE,  since  it  is  often  remarked 
(somewhat  vaguely)  that  ‘MLE  is  optimal’.  However,  MLE  is  a  general-purpose  method  that 
does  not  specifically  exploit  the  fact  that  the  state  is  low-rank.  Also,  the  optimality  results  for 
MLE  only  hold  asymptotically  and  for  informationally  complete  measurements  [38,  39];  for 
finite  data  [40]  or  for  incomplete  measurements,  MLE  can  be  far  from  optimal. 

Lrom  these  results,  one  can  extract  some  lessons  about  how  to  use  compressed  tomography. 
Compressed  tomography  involves  two  separate  ideas:  (1)  measuring  an  incomplete  set  of 
observables  (i.e.  choosing  m  d^)  and  (2)  using  trace  minimization  or  regularization  to 
reconstruct  low-rank  solutions.  Usually,  one  does  both  of  these  things.  Now,  suppose  that  the 
goal  is  to  reconstruct  a  low-rank  state  using  as  few  samples  as  possible.  Our  results  show  that 
one  can  achieve  this  goal  by  doing  (2)  without  (1).  At  the  same  time,  there  is  no  penalty  in 
the  quality  of  the  estimate  when  doing  (1),  and  there  are  practical  reasons  for  doing  it,  such  as 
reducing  the  size  of  the  data  set  to  speed  up  the  classical  post-processing. 

Quantum  process  tomography.  Linally,  we  adapt  our  method  to  perform  tomography  of 
processes  with  a  small  Kraus  rank.  Our  method  is  easy  to  implement,  since  it  requires  only  the 
ability  to  prepare  eigenstates  of  Pauli  operators  and  measure  Pauli  observables.  In  particular, 
we  require  no  entangling  gates  or  ancillary  systems  for  the  procedure.  In  contrast  with  [19], 
our  method  is  not  restricted  to  processes  that  are  element-wise  sparse  in  some  known  basis,  as 
discussed  in  section  6.2.  This  is  an  important  advantage  in  practice,  because  while  the  ideal  (or 
intended)  evolution  of  a  system  may  be  sparse  in  a  known  basis,  it  is  often  the  case  that  the 
noise  processes  perturbing  the  ideal  component  are  not  sparse,  and  knowledge  of  these  noise 
processes  is  key  to  improving  the  fidelity  of  a  quantum  device  with  some  theoretical  ideal. 

1.1.  Related  work 

While  initial  work  on  tomography  considered  only  linear  inversion  methods  [41],  most 
subsequent  works  have  largely  focused  on  maximum  likelihood  methods  and  to  a  lesser  extent 
on  Bayesian  methods  for  state  reconstruction  [35-37,  41-52]. 
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However,  there  has  recently  been  a  flurry  of  work  which  seeks  to  transcend  the  standard 
MLE  methods  and  improve  on  them  in  various  ways.  Our  contributions  can  also  be  seen  in  this 
context. 

One  way  in  which  alternatives  to  MLE  are  being  pursued  is  through  what  we  caW  full-rank 
methods.  Here  the  idea  is  somewhat  antithetical  to  ours:  the  goal  is  to  output  a  full-rank  density 
operator,  rather  than  a  rank  deficient  one.  This  is  desirable  in  a  context  where  one  cannot  make 
the  approximation  that  rare  events  will  never  happen.  Blume-Kohout’s  hedged  MLE  [53]  and 
Bayesian  mean  estimation  [52]  are  good  examples  of  this  type  of  estimator,  as  are  the  minimax 
estimator  of  [54]  and  the  so-called  Max-Ent  estimators  [55-58].  The  latter  are  specifically  for 
the  setting  where  the  measurement  data  are  not  informationally  complete,  and  one  tries  to 
minimize  the  bias  of  the  estimate  by  maximizing  the  entropy  along  the  directions  where  one 
has  no  knowledge. 

By  contrast,  our  low-rank  methods  do  not  attempt  to  reconstruct  the  complete  density 
matrix,  but  only  a  rank-r  approximation,  which  is  accurate  when  the  true  state  is  close  to  low- 
rank.  Erom  this  perspective,  our  methods  can  be  seen  as  a  sort  of  Occam’s  razor,  using  as  few 
fit  parameters  as  possible  while  still  agreeing  with  the  data  [59].  Eurthermore,  as  we  show  here 
and  elsewhere  [17],  informationally  incomplete  measurements  can  still  provide  faithful  state 
reconstructions  up  to  a  small  truncation  error. 

One  additional  feature  of  our  methods  is  that  we  are  deeply  concerned  with  the  feasibility 
of  our  estimators  for  a  moderately  large  number  of  qubits  (say,  10-15).  In  contrast  to  most 
of  the  existing  literature,  we  adopt  the  perspective  that  it  is  not  enough  for  an  estimator  to  be 
asymptotically  efficient  in  the  number  of  copies  for  fixed  d.  We  also  want  the  scaling  with 
respect  to  d  to  be  optimal.  We  specifically  take  advantage  of  the  fact  that  many  states  and 
processes  are  described  by  low-rank  objects  to  reduce  this  complexity.  In  this  respect,  our 
methods  are  similar  to  tomographic  protocols  that  are  tailored  to  special  ansatz  classes  of  states, 
such  as  those  recently  developed  for  use  with  permutation-invariant  states  [60],  matrix  product 
states  [61]  or  multi-scale  entangled  states  [62]. 

Our  error  bounds  are  somewhat  unique  as  well.  Most  previous  work  on  error  bounds 
used  either  standard  resampling  techniques  or  Bayesian  methods  [42,  43,  45,  48,  50,  52]. 
Very  recently,  Christandl  and  Renner  [63]  and  Blume-Kohout  [64]  independently  derived  two 
closely  related  approaches  for  obtaining  confidence  regions  that  satisfy  or  nearly  satisfy  certain 
optimality  criteria.  In  particular,  the  latter  approaches  can  give  very  tight  error  bounds  on  an 
estimate,  but  they  can  be  computationally  challenging  to  implement  for  large  systems.  The  error 
bounds  which  most  closely  resemble  ours  are  of  the  ‘large  deviation  type’;  see,  for  example,  the 
discussion  in  [38].  This  is  true  for  the  new  improved  error  bounds,  as  well  as  the  original  bounds 
proven  in  [17,  18].  These  types  of  bounds  are  much  easier  to  calculate  in  practice,  which  agrees 
with  our  philosophy  on  computational  complexity,  but  may  be  somewhat  looser  than  the  optimal 
error  bounds  obtainable  through  other  more  computationally  intensive  methods  such  as  those 
of  [63,  64]. 

1.2.  Notation  and  outline 

We  denote  Pauli  operators  by  P  or  P,.  We  define  [n]  =  (l,...,n}.  The  norms  we  use 
are  the  standard  Euclidean  vector  norm  ||  x  II2,  the  Erobenius  norm  ||X||f  =  ^Tx{X"'X),  the 
operator  norm  ||X ||  =  yAmax(V''*’2f)  and  the  trace  norm  ||X Hq.  =  Tr|X|,  where  |X|  =  ^JX^'X. 
The  unknown  ‘true’  state  is  denoted  by  p  and  any  estimators  for  p  are  given  a  hat:  p.  The 
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expectation  value  of  a  random  variable  X  is  denoted  by  E  X.  We  denote  by  the  set  oid  xd 
Hermitian  matrices. 

The  paper  is  organized  as  follows.  In  section  2,  we  detail  the  estimators  and  error  bounds 
and  then  upper  bound  the  sample  complexity.  In  section  3,  we  derive  lower  bounds  on  the 
sample  complexity.  In  section  4,  we  find  an  efficient  method  of  certifying  the  state  estimate.  In 
section  5,  we  detail  our  numerical  investigations.  We  show  how  our  scheme  can  be  applied  to 
quantum  channels  in  section  6  and  conclude  in  section  7. 

2.  Theory 

We  describe  our  compressed  tomography  scheme  in  detail.  Firstly,  we  describe  the  measurement 
procedure  and  the  method  for  reconstructing  the  density  matrix.  Then  we  prove  error  bounds 
and  analyze  the  sample  complexity. 

2.1.  Random  Pauli  measurements 

Consider  a  system  of  n  qubits,  and  let  d  =  2".  Let  V  be  the  set  of  all  d^  Pauli  operators,  i.e. 
matrices  of  the  form  P  =  ay®  ■  ■  ■  ®  On  where  ct,  g  {/,  o"',  o' ,  a^}. 

Our  tomography  scheme  works  as  follows.  Firstly,  choose  m  Pauli  operators,  P\,  . . . ,  Pm, 
by  sampling  independently  and  uniformly  at  random  from  V.  (Alternatively,  one  can  choose 
these  Pauli  operators  randomly  without  replacement  [65],  but  independent  sampling  greatly 
simplifies  the  analysis.)  We  will  use  t  copies  of  the  unknown  quantum  state  p.  For  each  i  G  [m], 
take  t/m  copies  of  the  state  p,  measure  the  Pauli  observable  P,  on  each  one  and  average  the 
measurement  outcomes  to  obtain  an  estimate  of  the  expectation  value  Tr(P,  p).  (We  will  discuss 
how  to  set  m  and  t  later.  Intuitively,  to  learn  a  d  x  d  density  matrix  with  rank  r,  we  will  set 
m  rd  log®  d  and  t  ~  r^d^  log  d.) 

To  state  this  more  concisely,  we  introduce  some  notation.  Define  the  sampling  operator  to 
be  a  linear  map  A:  ^  M"'  defined  for  all  i  G  [m]  by 

{A{p)),  =  J-TriPp).  (1) 

V  m 

(The  normalization  is  chosen  so  that  E  A* A  =  J,  where  J  denotes  the  identity  superoperator, 
and  A*  is  the  adjoint  of  A.  That  is.  A*  :  ^  is  the  complex-conjugate  transpose  of  the 

linear  map  A.)  We  can  then  write  the  output  of  our  measurement  procedure  as  a  vector 

y  =  Aip)  +  z,  (2) 

where  z  represents  statistical  noise  due  to  the  finite  number  of  samples.  (More  generally,  if  the 
measurements  contain  systematic  or  adversarial  noise,  these  can  also  be  represented  by  z.  There 
are  various  error  bounds  for  this  situation,  although  they  are  necessarily  more  pessimistic  than 
the  ones  we  show  here  [66].) 

2.2.  Reconstructing  the  density  matrix 

We  now  show  two  methods  for  estimating  the  density  matrix  p.  Both  are  based  on  the  same 
intuition:  find  a  matrix  X  G  that  fits  the  data  y  while  minimizing  the  trace  norm  ||X||tr, 
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which  serves  as  a  surrogate  for  minimizing  the  rank  of  X.  In  both  cases,  this  amounts  to  a 
convex  program,  which  can  be  solved  efficiently. 

(We  mention  that  at  this  point  we  do  not  require  that  our  density  operators  are  normalized 
to  have  unit  trace.  We  will  return  to  this  point  later  in  section  5.) 

The  first  estimator  is  obtained  by  constrained  trace-minimization  (also  known  as  the  matrix 
Dantzig  selector): 

Pds  =  arg  nun  II X II  tr  s.t.  M*  (-4(X)  -  y )  ||  ^  A .  (3) 

Note  that  the  constraint  bounds  the  operator  norm  of  A*{A{X)  —  y).  The  parameter  A  should  be 
set  according  to  the  amount  of  noise  in  the  data  y ;  we  will  discuss  this  later. 

The  second  estimator  is  obtained  by  least-squares  linear  regression  with  trace-norm 
regularization  (also  known  as  the  matrix  Lasso): 

PLasso  =  argimniM(X)  -y||2-i-/x||X||tr.  (4) 

Again  the  regularization  parameter  /x  should  be  set  according  to  the  noise  level;  we  will  discuss 
this  later. 

One  additional  point  is  that  we  do  not  require  positivity  of  the  output  in  our  definition  of  the 
estimators  (3)  and  (4).  One  can  add  this  constraint  (since  it  is  convex)  and  the  conclusions  below 
remain  unaltered.  We  will  explicitly  add  positivity  later  on  when  we  do  numerical  simulations, 
and  discuss  any  trade-offs  that  result  from  this. 

2.3.  Error  bounds 

In  previous  works  on  compressed  tomography  [17,  18],  error  bounds  were  proved  by 
constructing  a  ‘dual  certificate’  [67]  (using  convex  duality  to  characterize  the  solution  to  the 
trace-minimization  convex  program).  Here  we  derive  stronger  bounds  using  a  different  tool, 
known  as  the  RIP.  The  RIP  for  low -rank  matrices  was  first  introduced  in  [32],  and  was  recently 
shown  to  hold  for  random  Pauli  measurements  [30]. 

We  say  that  the  sampling  operator  A  satisfies  the  rank-r  RIP  if  there  exists  some  constant 
0  ^  <  1  such  that,  for  all  X  e  with  rank  r, 

(1-5,)||X||f^  M(X)||2^(1  +  5.)||X||f.  (5) 

For  our  purposes,  we  can  assume  that  X  is  Hermitian.  Note  that  this  notion  of  RIP  is  analogous 
to  that  used  in  [19],  except  that  it  holds  over  the  set  of  low-rank  matrices,  rather  than  the  set  of 
sparse  matrices. 

The  random  Pauli  sampling  operator  (1)  satisfies  RIP  with  high  probability,  provided  that 
m  ^  Crdlog^d  (for  some  absolute  constant  C);  this  was  recently  shown  in  [30].  We  note, 
however,  that  this  RIP  result  requires  m  to  be  larger  by  a  poly  log  d  factor  compared  to  previous 
results  based  on  dual  certificates.  Although  m  is  slightly  larger,  the  advantage  is  that  when 
combined  with  the  results  of  [31],  this  immediately  implies  strong  error  bounds  for  the  matrix 
Dantzig  selector  and  the  matrix  Lasso. 

To  state  these  improved  error  bounds  precisely,  we  introduce  some  definitions.  For  the  rest 
of  section  2,  let  C,  Co,  C\,  Cq  and  C[  be  fixed  absolute  constants.  For  any  quantum  state  p,  we 
write  p  =  p^  +  Pp,  where  p^  is  the  best  rank-r  approximation  to  p  (consisting  of  the  largest  r 
eigenvalues  and  eigenvectors),  and  pc  is  the  residual  part.  Now  we  have  the  following: 
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Theorem  1,  Let  A  be  the  random  Pauli  sampling  operator  (1)  with  m^  Crd  log®  d.  Then,  with 
high  probability,  the  following  holds: 

Let  Pds  be  the  matrix  Dantzig  selector  (3),  and  choose  X  so  that  ||^*(z)||  ^  X.  Then 
IIPdS  —  plltr  ^  CorX  +  Cl  lIPcIltr- 

Alternatively,  let  PLasso  be  the  matrix  Lasso  (4),  and  choose  pL  so  that  ||.4,*(z)||  ^  pl/2.  Then 
II  Pl  asso  Plltr  ^  Cor/X  +  Ci||Pc||tr- 

In  these  error  bounds,  the  first  term  depends  on  the  statistical  noise  z.  This,  in  turn,  depends 
on  the  number  of  copies  of  the  state  that  are  available  in  the  experiment;  we  will  discuss  this  in 
the  next  section.  The  second  term  is  the  rank-r  approximation  error.  It  is  clearly  optimal,  up  to 
a  constant  factor. 

Proof.  These  error  bounds  follow  from  the  RIP  as  shown  by  Theorem  2.1  in  [30],  and  a 
straightforward  modification  of  Lemma  3.2  in  [31]  to  bound  the  error  in  trace  norm  rather  than 
Frobenius  norm  (this  is  similar  to  the  proof  of  Theorem  5  in  [66]). 

The  modification  of  Lemma  3.2  in  [31]  is  as  follows®.  (For  the  remainder  of  this  section, 
equation  numbers  of  the  form  (III.x)  refer  to  [31].)  In  the  case  of  the  Dantzig  selector,  let 
H  =  pps  —  p  be  the  difference  of  the  matrix  Dantzig  selector  and  the  quantum  state.  Following 
equation  (III.8)  and  decomposing  H  =  Hq  +  He  exactly  as  in  [31],  we  can  obtain  the  following 
bound: 

ll^lltr  ^  l|//olltr+  l|//clltr  ^  2 1| //q  II  tr  +  2 1|  P,  |1  tr,  (6) 

where  we  used  the  triangle  inequality  and  equation  (III. 8).  Then,  at  the  end  of  the  proof,  we 
write 

II  //olltr  ^  V^ll  //oIIf  ^  V^ll  Ha  +  H,  IIf 

^Ci4V2rA  +  Ci2V234,||p,||tr,  (7) 

where  we  used  Cauchy-Schwarz,  the  fact  that  Hq  and  Hi  are  orthogonal,  the  bound  on 
I|7/o  +  7/i||f  following  equation  (III.  13)  and  equation  (IIL7).  Combining  (6)  and  (7)  gives  our 
desired  error  bound.  The  error  bound  for  the  Lasso  is  obtained  in  a  similar  way;  see  section 
III.Gin[31].  □ 

2.4.  Sample  complexity 

Here  we  bound  the  sample  complexity  of  our  tomography  scheme;  that  is,  we  bound  the  number 
of  copies  of  the  unknown  quantum  state  p  that  are  needed  to  obtain  our  estimate  up  to  some 
accuracy.  What  we  show,  roughly  speaking,  is  that  t  =  C?((y)^  log  J)  copies  are  sufficient  to 
reconstruct  an  estimate  of  an  unknown  rank  r  state  up  to  accuracy  e  in  the  trace  distance.  For 
comparison,  note  that  when  r  =  d,  and  one  does  full  tomography,  0(d'^fe^)  copies  are  sufficient 
to  estimate  a  full-rank  state  with  accuracy  e  in  trace  distance^. 

^  Note  that  [31]  contains  a  typo  in  Lemma  3.2:  on  the  right-hand  side,  in  the  second  term,  it  should  be  \\Mc\\t/.fr 
and  not  ||Mc|h/r. 

^  To  see  this,  let  Pi,  ... ,  P42  denote  the  Pauli  matrices.  For  each  i  e  [cP},  measure  P,-  0{<P  js^)  times,  to  estimate 
its  expectation  value  with  additive  error  ±0(e/d).  Equivalently,  one  estimates  the  expectation  value  of  P,  /\/d 
with  additive  error  Using  linear  inversion,  one  obtains  an  estimated  density  matrix  with  additive  error 

0(s/d'^^)  in  Frobenius  norm,  which  implies  error  0(e)  in  trace  norm. 
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To  make  this  claim  precise,  we  need  to  specify  how  we  construct  our  data  vector  y  from  the 
measurement  outcomes  on  the  t  copies  of  the  state  p.  For  the  matrix  Dantzig  selector,  suppose 
that 

t  ^  2C4(Corlsfd(d  +  1)  log d  (8) 

for  some  constants  C4  >  1  and  s  ^  Co-  (For  the  matrix  Lasso,  substitute  Cq  for  Co  in  these 
equations.)  We  construct  an  estimate  of  A(p)  as  follows:  for  each  i  e  [m],  we  take  t/m  copies 
of  p,  measure  the  random  Pauli  observable  P,  on  each  of  the  copies  and  use  this  to  estimate 
Tr(P,p).  Then  let  y  be  the  resulting  estimate  of  A(p),  and  let  z  =  y  —  A(p).  Everything  else  is 
defined  exactly  as  in  theorem  1. 

Theorem  2.  Givent  =  O  log  d)  copies  of  p  as  in  equation  (8)  and  measured  as  discussed 
above,  the  following  holds  with  high  probability  over  the  measurement  outcomes. 

Let  yOos  be  the  matrix  Dantzig  selector  (3),  and  set  X  =  ej (Cor)  for  some  e  >  0.  Then 

IIPdS  ~  plltr  ^  e  +  Cl  lIPcIltr- 

Alternatively,  let  PLasso  he  the  matrix  Lasso  (4),  and  set  /x  =  s/(Cor).  Then 
II  PLasso  P  II  tr  ^  ^  C  f^Pc  II  tr- 


Proof.  Our  claim  reduces  to  the  following  question:  if  we  fix  some  value  of  k  >  0,  how  many 
copies  of  p  are  needed  to  ensure  that  the  measurement  data  y  satisfy  ||.4*(y  —  .4(p))||  ^  k? 
Then  one  can  apply  theorem  1  to  obtain  an  error  bound  for  our  estimate  of  p. 

Let  t  be  the  number  of  copies  of  p.  Say  we  fix  the  measurement  operator  A,  i.e.  we  fix 
the  choice  of  the  Pauli  observables  Pu  ■  ■  ■ ,  Pm-  (The  measurement  outcomes  are  still  random, 
however.)  For  /  e  [m]  and  7  G  [f/m],let  Bij  G  {1,  —  1}  be  the  outcome  of  the  jth  repetition  of  the 
experiment  that  measures  the  /th  Pauli  observable  P,  .  Note  that  E  Bij  =  Tr(P,p).  Then  construct 
the  vector  y  g  containing  the  estimated  expectation  values  (scaled  by  ^d/m)\ 


t  jm 


'  d  171  \  y 

■j^Bij,  ie[m]. 

7=1 


(9) 


Note  that  E  y  =  A{p). 

We  will  bound  the  deviation  ||.4*(y  —  .4.(p))||,  using  the  matrix  Bernstein  inequality.  First, 
we  write 


^*(3^)  = 


m  t/m 

YALrn,,. 


1=1  7=1 


(10) 


and  also 


7 

A*A{p)  =  -TP{lAPip).  (11) 

m 

i  =  l 

We  can  now  write  A^iy  —  A{p))  as  a  sum  of  independent  (but  not  identical)  matrix-valued 
random  variables: 

m  t/m  ^ 

A*iy-Aip))  =  P,[B„-Tr(P,p)].  (12) 

1=1  j=i 
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Note  that  EX,-^  =  0  and  ||X,y||  ^  Idjt  =:  R.  Also,  for  the  second  moment  we  have 
E(X2)  =  E  -Tr(P,p)]'^  =  ^/[l  -Tr(P,p)2]. 


Then  we  have 


II  ^2  ^2  //2 

-  I E  E  (4)  I  =  E  ^ [1  -  P)1  ^  ^  ^  =  7- 

u 


(13) 


(14) 


Now  the  matrix  Bernstein  inequality  (Theorem  1.4  in  [68])  implies  that 
Pr[||.4-(,  -  .4(p))||  >  X]  <  J .  exp  '  “P  (-5^) 

(where  we  assumed  that  X  ^  1). 

For  the  matrix  Dantzig  selector,  we  set  A  =  e/(Cor),  and  we  get  that,  for  any  t  ^ 
2CA,X~^d{d  +  l)log<i  =  ICniCor  /  s^did  +  l)log(i. 


Pr 


M*(y-A(p))|| 


e 

Cor  _ 


^  r/ •exp(— C4log(i)  =  r/' 


(16) 


which  is  exponentially  small  in  C4.  Plugging  into  theorem  2  completes  the  proof  of  our  claim. 
A  similar  argument  works  for  the  matrix  Lasso.  □ 


3.  Lower  bounds 

How  good  are  the  sample  complexities  of  our  algorithms?  Here  we  go  a  long  way  toward 
answering  this  question  by  proving  nearly  tight  lower  bounds  on  the  sample  complexity  for 
generic  rank  r  quantum  states  using  single-copy  Pauli  measurements.  Previous  work  on  single¬ 
copy  lower  bounds  has  only  treated  the  case  of  pure  states  [33]. 

Roughly  speaking,  we  show  the  following,  which  we  will  make  precise  at  the  end  of  the 
section. 

Theorem  3  (Imprecise  formulation).  The  number  of  copies  t  needs  to  grow  at  least  as  fast  as 
Clij'^d'^  j  log  d)  to  guarantee  a  constant  trace-norm  confidence  interval  for  all  rank-r  states. 

The  argument  proceeds  in  three  steps.  First,  we  fix  our  notion  of  risk  to  be  the  minimax  risk, 
meaning  we  seek  to  minimize  our  worst-case  error  according  to  some  error  metric  such  as  the 
trace  distance.  We  want  to  know  how  many  copies  of  the  unknown  state  we  need  to  make 
this  minimax  risk  an  arbitrarily  small  constant.  For  a  fixed  set  of  two-outcome  measurements, 
say  Pauli  measurements,  we  then  show  that  some  states  require  many  copies  to  achieve 
this.  In  particular,  these  states  have  the  property  that  they  are  globally  distinguishable  (their 
trace  distance  is  bounded  from  below  by  a  constant)  but  their  (Pauli)  measurement  statistics 
look  approximately  the  same  (each  measurement  outcome  is  close  to  unbiased).  The  more 
such  states  there  are,  the  more  copies  we  need  to  distinguish  between  them  all  using  solely 
Pauli  measurements.  Finally,  we  use  a  randomized  argument  to  show  that,  in  fact,  there  are 
exponentially  many  such  states.  This  yields  the  desired  lower  bound  on  the  sample  complexity. 

Let  S  be  some  set  of  density  operators.  We  want  to  put  lower  bounds  on  the  performance 
of  any  estimation  protocol  for  states  in  S .  (We  do  not  initially  restrict  ourselves  to  states  with 
low  rank.) 
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We  assume  that  the  protocol  has  access  to  t  copies  of  an  unknown  state  p  G  S  on  which  it 
makes  measurements  one  by  one.  At  the  ith  step,  it  has  to  decide  which  observable  to  measure. 
Let  us  restrict  ourselves  to  two-outcome  measurements,  which  can  be  described  by  positive 
operator-valued  measures  (POVMs)  {ff,,  1  —  O,},  where  each  O,  satisfies  0^0,^!  and  may 
be  chosen  from  a  set  V.  (We  do  not  initially  restrict  ourselves  to  Pauli  measurements,  either.) 
We  allow  the  choice  of  the  ith  observable  to  depend  on  the  previous  outcomes.  We  refer  to 
the  random  variables  which  jointly  describe  the  choice  of  the  /th  measurement  and  its  random 
outcome  as  F,-.  At  the  end,  these  are  mapped  to  an  estimate  p(Fi,  . . . ,  F,)  G  S. 

In  other  words,  an  estimation  protocol  is  specified  by  a  set  of  functions 


n,- :  Fi, . . . ,  F_1  ^  iP, 
p:Fi,...,L^S. 


Consider  a  distance  measure  A  :  S  x  S  — ^  M  on  the  states  in  S.  (For  example,  this  could 
be  the  trace  distance  or  the  infidelity;  we  need  not  specify.)  Suppose  that  we  are  given  t  copies 
of  the  unknown  state  p,  and  the  maximum  deviation  we  wish  to  tolerate  between  our  estimate 
p  and  the  true  state  p  is  given  by  e  >  0.  Now  define  the  minimax  risk 


M*(t,e)=  inf  sup  PrrA(p(F),  p)  >  el, 

(P.n,)  peE 


(17) 


where  the  infimum  is  over  all  estimation  procedures  (p,  fl,)  on  t  copies  with  estimator  p  and 
choice  of  observables  given  by  ff,.  That  is,  we  are  considering  the  ‘best’  protocol  to  be  the  one 
whose  worst-case  probability  of  deviation  is  the  least. 

The  next  lemma  shows  that  if  there  are  a  large  number  of  states  in  S  which  are  far  apart  (by 
at  least  e)  and  whose  statistics  look  nearly  random  for  all  measurements  in  V,  then  the  minimax 
risk  is  lower-bounded  by  a  function  that  decreases  linearly  with  the  number  of  copies  t.  Hence, 
to  achieve  a  small  minimax  risk,  the  number  of  copies  t  must  be  large. 

Lemma  1.  Assume  that  there  are  states  pi,  . . . ,  Pj  C  S  such  that 


V/  /  j  :  A(pi,  pj)  ^  e, 
V/,VnGT’:|trp,n-iKQ!. 


Then  the  minimax  risk  as  defined  in  (17)  fulfills 


log  5 


Proof.  Let  X  be  a  random  variable  taking  values  uniformly  in  [5].  Let  Fi,  . . . ,  F,  be  random 
variables,  F,  describing  the  outcome  of  the  ith  measurement  carried  out  on  px-  Define 


X(F)  :=  argmin  A(p(F),  p,). 


Then 


Pr[A(p(F),  p,)  >e\7>  Pr[i(F)  /  X]. 


(18) 


Now  combine  Fano’s  inequality 


HiX\X)  ^  l-FPr[XytX]log5, 
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the  data  processing  inequality 

IiX-,XiY))^l{X-,Y), 

in  terms  of  the  mutual  information  /  (X :  Y)  :=  H{X)  — //(X|F),  and  the  fact  that //(X)  =  log5 
to  obtain 


Pr[X(F)  j^X]^ 


=  1 


^  1 


=  1 


H(X\X)-\ 

logs 

//(X)-/(X;X)-1 
log  5 

/(X;  X)  +  l 


>  1  - 


log  5 

/(X;  F)  +  l 
log  5 

H{Y)-H{Y\X)  +  l 
logs 


log  5 


Let  h{p)  be  the  binary  entropy  and  recall  the  standard  estimate 
h{l/l±a)  ^  il-4a^). 

Combine  that  with  the  chain  rule  [69,  Theorem  2.5.1] 

t 

H(Y\X  =  i)  =  H{Yj\Yj.i,  . . . ,  Fi,  X  =  /) 

1=1 

^  til-4a^). 

The  advertised  bound  follows. 


□ 


We  now  show  how  to  construct  a  set  of  states,  each  with  rank  r,  that  satisfy  the  conditions  of 
lemma  1  in  terms  of  the  trace  distance  and  the  set  of  Pauli  measurements.  The  following  lemma 
shows  how,  given  such  a  set  of  states,  we  can  enlarge  it  by  one.  We  will  then  apply  this  lemma 
repeatedly,  to  get  a  total  of  5  <  states. 


Lemma  2.  For  any  0  <  e  <  1  —  let  pi,  ps  be  normalized^  rank-r  projections  on  C‘^, 
where  s  <  and  c(e)  is  specified  below.  Then  there  exists  a  normalized  rank-r  projection 
p  such  that 


Vi  G  [si  :  fllp-p; 


5K0;. 


VP,7tl:|tr[i(l±P,)p] 

Here,  the  Pk  are  n -qubit  Pauli  operators, 

ln(8/7r)  ,  1  r/,  _  _ 

'Lv  d) 


c{e)  = 


2rd 


1 

-  H - 

32 


(19) 

(20) 


Normalized  to  have  trace  1. 
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Proof.  Let  po  be  some  normalized  rank-r  projection  and  choose  p  according  to^ 
p  =  OpoO^ 

for  a  Haar-random  O  G  SO{d).  Here  we  use  the  special  orthogonal  group  SO(d)  because  the 
analysis  becomes  marginally  simpler  than  if  we  use  a  unitary  group. 

To  check  (19),  choose  i  G  [5]  and  define  /?,  to  be  the  projector  onto  the  range  of  p,  .  Also 
define  the  function 

f:Ot^\\p,-OpoO^\W. 

We  can  bound  the  magnitude  of  /  using  the  pinching  inequality: 

fiO)  ^  \\pi-RiOpoO^ RAL+\\RfOpoO^ Rf\W  >  ti{pt)-ti{RtOpoO^Ri)  +  HRfOpoO^Rf) 
=  \+tr[(Rf-R.)OpoO^]. 

From  this  we  can  bound  the  expectation  value  of  /  over  the  special  orthogonal  group: 

E  [f(0)]  ^  l+tr[(Rl-  -  Ri)(E  OpO^)] 

=  1  +  ^ti[iRl-  -  /?,)!] 


=  1  + 


d  —  2r 


Next  we  get  an  upper  bound  of  ^  on  the  Lipschitz  constant  of  /  with  respect  to  the  Frobenius 
norm: 

\f(0  +  A)-f{0)\  ^  \\iO  +  A)po{0  +  AV -OpoO^K 

^  ||OpoA^|L+||ApoO''||,..+  ||ApoA''||,, 

=  2||Apo||tr  +  tr(ApoA^) 

^  2Vt||  Apo||F  +  tr(poAA^) 

^2v^||A||f||po||  +  -||A||||A''||,, 
r 

^^||A||f+-V^||A||f^^||A||f, 
r  Vr 

where  we  use  ||  A  ||^  2  in  the  last  line,  which  follows  from  the  triangle  inequality  and  the  fact 
that  any  A  can  be  written  as  a  difference  A  =  O'  —  O  for  O'  G  SO{d). 

From  these  ingredients,  we  can  invoke  Levy’s  lemma  on  the  special  orthogonal  group 
[70,  Theorem  6.5.1]  to  obtain  that,  for  all  f  >  0, 


Pr[llPi  -  Plltr  <  2(1  -  r/d)  -  f]  ^  e‘' 


16 

where  the  constants  are  given  by  ci  =  ln(>/7r/8)  and  C2  =  1  /8.  Now  choose  t  =  2(1  —  r/d)  —  2e 
and  apply  the  union  bound  to  obtain 

Pr[(19)  does  not  hold]  <  Pr[||p,  —  p||tr  <  2(1  —  r/J)  —  f] 

/2- 


^  exp  (^rd  I 


Cjt 


+  Ci  )  — 


For  the  duration  of  this  proof,  the  letter  O  denotes  an  element  of  SO{d}  instead  of  the  asymptotic  big-0  notation. 
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The  upper  bound  on  e  follows  from  the  requirement  that  t  >  0.  This  shows  that  p  indeed  satisfies 
equation  (19). 

Now  we  move  on  to  equation  (20).  For  any  non-identity  Pauli  matrix  define  a  function 

Clearly,  we  have  E  [/(O)]  =  0.  We  again  wish  to  bound  the  rate  of  change  so  that  we  can  use 
Levy’s  lemma,  so  we  compute 

(do/)  (A)  =  tripoO^P,A)  +  triP,OpoA^)  =  tr[(poO^  Pt  + p^  0^P[)A], 

which  implies  that 

II  V/((9)||2  =  \\PoO^Pk  +  P^ O^P^ IIf  ^ 


Levy’s  lemma  then  gives  for  all  f  >  0 

Pr[|trP<,p|  >  f]  ^  e‘'  exp^ — ^ 

Choosing  t  =  2a,  and  a^  =  4\n{d‘^7t /S) / (rd),  then  the  union  bound  gives  us 

C2i2a)^rd 


Pr[(20)  does  not  hold]  <  d^Px\\tiPkp\  >  2a]  ^  exp^21n(i 


+  ci^  —  L 


from  which  the  lemma  follows. 


□ 


We  remark  that  a  version  of  lemma  2  continues  to  hold  even  if  we  can  adaptively  choose 
from  as  many  as  2^^"^  additional  measurements  which  are  globally  unitarily  equivalent  to  Pauli 
measurements. 

We  now  combine  the  two  previous  lemmas,  to  get  a  precise  version  of  theorem  3.  (To 
summarize:  we  use  lemma  2  repeatedly  to  construct  a  large  set  of  states  that  are  difficult  to 
distinguish  using  Pauli  measurements;  then  we  use  lemma  1  to  lower-bound  the  minimax  risk 
for  estimating  an  unknown  state  using  t  copies;  hence,  if  an  estimation  procedure  has  a  small 
minimax  risk,  t  must  be  large.) 


Theorem  4  (Precise  version  of  theorem  3).  Fix  e  e  (0,  1  —  ^)  and  8  e  [0,  1).  Then,  in  order 
for  an  estimation  procedure  to  satisfy  M*(t,  e)  ^  8,  the  number  of  copies  t  must  grow  like 


where  the  implicit  constant  depends  on  8  and  e. 


4.  Certifying  the  state  estimate 

Here  we  sketch  how  the  technique  of  DFE,  introduced  in  [33,  34]  for  pure  states,  can  be  used  to 
estimate  the  fidelity  between  a  low-rank  estimate  p  and  the  true  state  p.  The  only  assumption  we 
make  is  that  yO  is  a  positive  semidefinite  matrix  with  Tr(p)  ^  1  and  rank(p)  =  r.  No  assumption 
at  all  is  needed  on  p.  In  fact,  we  also  do  not  assume  that  we  obtained  the  estimate  p  from 
any  of  the  estimators  which  were  discussed  previously.  Our  certification  procedure  works 
regardless  of  how  one  obtains  p,  and  so  it  even  applies  to  the  situation  where  p  was  chosen 
from  a  subset  of  variational  ansatz  states,  as  in  [61]. 
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Recall  that  the  main  idea  of  DFE  is  to  take  a  known  pure  state  \  and  from  it  define  a 
probability  distribution  Pr(/)  such  that  by  estimating  the  Pauli  expectation  values  Tr(pP,  )  and 
suitably  averaging  it  over  Pr(/)  we  can  learn  an  estimate  of  (i/^ |p|i/^).  In  fact,  one  does  not  need 
to  do  a  full  average;  simply  sampling  from  the  distribution  a  few  times  is  sufficient  to  produce  a 
good  estimate. 

For  non-Hermitian  rank-1  matrices  \4>j){^k\  instead  of  pure  states,  we  need  a  very  slight 
modification  of  DFE.  Following  the  notation  in  [33],  we  simply  redefine  the  probability 
distribution  as  'Pr{i)  =  \{^j\Pi\(pk)\^ /d  and  the  random  variable  X  =  Tr(P, p)/ {(/><. |P,  10^).  It 
is  easy  to  check  that  IE(X)  =  ((/>y|p|(/>jr)  and  the  variance  of  X  is  at  most  one.  Then  all  of 
the  conclusions  in  [33,  34]  hold  for  estimating  the  quantity  ((/>y|p|(/>*:).  In  particular,  we  can 
obtain  an  estimate  to  within  an  error  ±c  with  probability  1  —  (5  by  using  only  single-copy  Pauli 
measurements  and  0(^-1-  copies  of  p. 

Our  next  result  shows  that  by  obtaining  several  such  estimates  using  DFE,  we  can  also  infer 
an  estimate  of  the  mixed  state  fidelity  between  an  unknown  state  p  and  a  low-rank  estimate  p, 
given  by 

F(p,  p)  =  [TrVG]^  (21) 

where  for  brevity  we  define  G  =  (Note  that  some  authors  use  the  square  root  of  this 

quantity  as  the  fidelity.  Our  definition  matches  [33].)  The  asymptotic  cost  in  sample  complexity 
is  far  less  than  the  cost  of  initially  obtaining  the  estimate  p  when  r  is  sufficiently  small  compared 
to  d. 

Theorem  5,  Given  a  state  estimate  p  with  rank(p)  =  r,  the  number  of  copies  t  of  the  state  p 
required  for  estimating  F{p,p)  to  within  ±e  with  probability  1—5  using  single-copy  Pauli 
measurements  satisfies 

l  =  o(^^[dlog(rV5)  +  rV5]^.  (22) 

Proof.  The  result  uses  the  DEE  protocol  of  [33,  34],  modified  as  mentioned  above,  where  the 
states  \(j)j)  are  the  eigenstates  of  p. 

Expand  p  in  its  eigenbasis  as  p  =  Then  we  have 

r 

G  =  ^  y^(0,|p|0r)l0i>(</>r|.  (23) 

M=i 

Eor  all  1  ^  7  ^  k  ^  r,  we  use  DFE  to  obtain  an  estimate  gjk  of  the  matrix  element 
{4>j\p\4>k},  up  to  some  additive  error  Cjk  that  is  bounded  by  a  constant,  \€jk\  ^  Cq. 

If  each  estimate  is  accurate  with  probability  I  —2S/(r^  -\-r),  then  by  the  union  bound  the 
probability  that  they  are  all  accurate  is  at  least  1—5.  The  total  number  of  copies  t  required  for 
this  is 

t  =  o(^'^[dlog(r^/8)  +  r^/8]y  (24) 

Let  gjk  =  gf,  and  let 

r 

G=Y^  ./)^kgjk\4>Mk\  (25) 

j.k=\ 
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be  our  estimate  for  G.  Finally,  let  =  [G]+  be  the  positive  part  of  the  Hermitian  matrix  G, 
and  let  F  =  [Tr-v/G+]^  be  our  estimate  of  the  fidelity  F{p,  p).  Note  that  we  may  assume  that 
F  ^  1,  since  if  it  were  larger,  we  can  only  improve  our  estimate  by  just  truncating  it  back  to  1. 

We  now  bound  the  error  of  this  fidelity  estimate.  We  can  write  G  as  a  perturbation  of  G, 
G  =  G  +  E,  where 


r 

E  —  ^  ^  j  j  k\^  j  ){4^k  I  ■ 

i,k=\ 

First  note  that  the  Frobenius  norm  of  this  perturbation  is  small, 

j  /2 

II-E'IIf  =  i^k\G jk?^  ^  =  ^0- 

j.k  i 


(If  p  is  subnormalized,  then  this  last  equality  becomes  an  inequality.) 
Next  observe  that 


|F-F|  = 


Tr( Vg  +  \/^)Tr(VG  -  /&)  I  ^  2 A , 


where  we  define 


(26) 


(27) 


A  =  |Tr(VG-\/^)|. 


(28) 


Using  the  reverse  triangle  inequality,  we  can  bound  A  in  terms  of  the  trace  norm 


A  ^ 


(29) 


Using  [71,  Theorem  X.1.3]  in  the  first  step,  we  find  that 


< 


|G  -G" 


where  the  second  bound  follows  from  the  Cauchy-Schwarz  inequality  on  the  vector  of 
eigenvalues  of  |G  — G^|.  Using  the  Jordan  decomposition  of  a  Hermitian  matrix  into  a 
difference  of  positive  matrices  X  =  [X]+  —  [X]_,  we  can  rewrite  G  —  G"^  as 


G  -  G-^  =  G  -  [G  +  F]+  =  -F  -  [G  +  F]_. 


(30) 


Then  by  the  triangle  inequality  and  positivity  of  G, 

||G-G"||tr^  ||F||.,+  ||[G  +  F]_||tr^2||F||,,.  (31) 

Using  the  standard  estimate  ||F||tr  ^  v^||F||f,  we  find  that 

A  ^  (32) 

This  gives  our  desired  error  bound,  albeit  in  terms  of  Cq  instead  of  the  final  quantity  e.  To  get 
our  total  error  to  vanish,  we  take  e  =  which  gives  us  the  final  scaling  in  the  sample 

complexity.  □ 


As  a  final  remark,  we  note  that  by  computing  A  for  the  special  case  p  =  p  =  'l/r,  E  =  e^l/r, 
we  find  that 

A=yr+7^-i,  (33) 
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so  the  error  bound  for  this  protocol  is  tight  with  respect  to  the  scaling  in  Cq  (and  hence  e). 
However,  we  cannot  rule  out  that  there  are  other  protocols  that  achieve  a  better  scaling.  Also, 
it  seems  that  the  upper  bound  for  the  current  protocol  with  respect  to  r  could  potentially  be 
improved  by  a  factor  of  r  with  a  more  careful  analysis. 

5.  Numerical  simulations 

We  numerically  simulated  the  reconstruction  of  a  quantum  state  given  Pauli  measurements  and 
the  estimators  from  equations  (3)  and  (4).  Here  we  compare  these  estimates  to  those  obtained 
from  a  standard  maximum  likelihood  estimate  (MLE)  as  a  benchmark. 

As  mentioned  previously,  all  of  our  estimators  have  the  advantage  that  they  are  convex 
functions  with  convex  constraints,  so  the  powerful  methods  of  convex  programming  [72] 
guarantee  that  the  exact  solution  can  be  found  efficiently  and  certified.  We  take  full  advantage 
of  this  certifiability  and  do  simulations  which  can  be  certified  by  interior-point  algorithms.  This 
way  we  can  separate  out  the  performance  of  the  estimators  themselves  from  the  (sometimes 
heuristic)  methods  used  to  compute  them  on  very  large  scale  instances. 

For  our  estimators,  we  will  explicitly  enforce  the  positivity  condition.  That  is,  we  will 
simulate  the  following  slight  modifications  of  equations  (3)  and  (4)  given  by 

Pds  =  arg  min  TrA  s.t.  ||  A*(A(A)  -  y)  |K  A  (34) 

and 

Puasso  =  arg  min  ^  ||  A(A)  -y\\l  +  /xTrX.  (35) 

Moreover,  whenever  the  trace  of  the  resulting  estimate  is  less  than  1 ,  we  renormalize  the  state, 
p  ^p/Tr(p). 

5.1.  Setting  the  estimator  parameters  A  and  /x 

From  theorem  1  we  know  roughly  how  we  should  choose  the  free  parameters  A  and  /x,  modulo 
the  unknown  constants  C,  ,  for  which  we  will  have  to  make  an  empirical  guess.  We  guess  that 
the  log  factor  is  an  artifact  of  the  analysis  and  that  the  state  is  nearly  pure.  Then  we  could 
choose  A  ~  /X  ~  dj-sft.  For  the  matrix  Dantzig  selector  of  equation  (34),  we  specifically  chose 
A  =  3dl-s/i  for  our  simulations.  However,  this  did  not  lead  to  very  good  performance  for  the 
matrix  Fasso  of  equation  (35)  when  m  was  large.  We  chose  instead  /x  =  Amj .^fi,  which  agrees 
with  A  when  m  d,  as  it  can  be  for  nearly  pure  states. 

We  stress  that  very  little  effort  was  made  to  optimize  these  choices  for  the  parameters.  We 
simply  picked  a  few  integer  values  for  the  constants  in  front  (namely  2,  . . . ,  5),  and  chose  the 
constant  that  worked  best  upon  a  casual  inspection  of  a  few  data  points.  We  leave  open  the 
problem  of  determining  what  the  optimal  choices  are  for  A  and  /x. 

5.2.  Time  needed  to  switch  measurement  settings 

Real  measurements  take  time.  In  a  laboratory  setting,  time  is  a  precious  commodity  for  a 
possibly  non-obvious  reason.  An  underlying  assumption  in  the  way  we  typically  formulate 
tomography  is  that  the  unknown  state  p  can  be  prepared  identically  many  times.  However,  the 
parameters  governing  the  experiment  often  drift  over  a  certain  timescale,  and  this  means  that 
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beyond  this  characteristic  time,  it  is  no  longer  appropriate  to  describe  the  measured  ensemble 
by  the  symmetric  product  state  p®'. 

To  account  for  this  characteristic  timescale,  we  introduce  the  following  simple  model.  We 
assume  that  the  entire  experiment  takes  place  in  a  fixed  amount  of  time  T.  In  a  real  experiment, 
this  is  the  timescale  beyond  which  we  cannot  confidently  prepare  the  same  state  p,  or  it  is  the 
amount  of  time  it  takes  for  the  student  in  the  laboratory  to  graduate,  or  for  the  grant  funding  to 
run  out,  whichever  comes  first.  Within  this  time  limit,  we  can  either  change  the  measurement 
settings,  or  we  can  take  more  samples.  We  assume  that  there  is  a  unit  cost  to  taking  a  single 
measurement,  but  that  switching  to  a  different  measurement  configuration  takes  an  amount  of 
time  c. 

At  the  one  extreme,  the  switching  cost  c  to  change  measurement  settings  could  be  quite 
large,  so  that  it  is  barely  possible  to  make  enough  independent  measurements  that  tomography 
is  possible  within  the  allotted  time  T .  In  this  case,  we  expect  that  compressed  tomography 
will  outperform  standard  methods,  since  these  methods  have  not  been  optimized  for  the  case  of 
incomplete  data.  At  the  other  extreme,  the  switching  cost  c  could  be  set  to  0  (or  some  other  small 
value),  in  which  case  we  are  only  accounting  for  the  cost  of  sampling.  Here  it  is  not  clear  which 
method  is  better,  for  the  following  reason.  Although  the  standard  methods  are  able  to  take  a 
sufficient  number  of  data  in  this  case,  each  observable  will  attain  comparatively  little  precision 
relative  to  the  fewer  observables  measured  with  higher  precision  in  the  case  of  compressed 
tomography  for  the  same  fixed  amount  of  time  T .  One  of  our  goals  is  to  investigate  if  there  are 
any  trade-offs  in  this  simple  model. 

For  the  relative  cost  c  between  switching  measurement  settings  and  taking  one  sample,  we 
use  c  =  20,  a  number  inspired  by  the  current  ion  trap  experiments  done  by  the  Blatt  group  in 
Innsbruck.  However,  we  note  that  the  conclusions  do  not  seem  to  be  very  sensitive  to  this  choice, 
as  long  as  we  do  not  do  something  outrageous  like  c  >  t,  which  would  preclude  measuring  more 
than  even  one  observable. 

5.3.  Other  simulation  parameters 

We  consider  the  following  ensembles  of  quantum  states.  We  initially  choose  n  =  5  qubit  states 
from  the  invariant  Haar  measure  on  C^" .  Then  we  add  noise  to  the  state  by  applying  independent 
and  identical  depolarizing  channels  to  each  of  the  n  qubits.  Recall  that  the  depolarizing  channel 
with  strength  y  acts  on  a  single  qubit  as 

2?y(p)  =  y^  +  (i-y)p;  (36) 

that  is,  with  probability  y  the  qubit  is  replaced  by  a  maximally  mixed  state  and  otherwise  it  is 
left  alone.  Our  simulations  assume  very  weak  decoherence,  and  we  use  y  =  0.01. 

We  use  two  measures  to  quantify  the  quality  of  a  state  reconstruction  p  relative  to 
the  underlying  true  state  p,  namely  the  (squared)  fidelity  •^he  trace  distance 

\\\p-p\W- 

We  used  the  interior-point  solver  SeDuMi  [26]  to  do  the  optimizations  in  equations  (34) 
and  (35).  Although  these  algorithms  are  not  suitable  for  large  numbers  of  qubits,  they  produce 
very  accurate  solutions,  which  allows  for  a  more  reliable  comparison  between  the  different 
estimators.  The  data  and  the  code  used  to  generate  the  data  for  these  plots  can  be  found  with  the 
source  code  for  the  arXiv  version  of  this  paper.  For  larger  numbers  of  qubits,  one  may  instead 
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use  specialized  algorithms  such  as  singular  value  thresholding  (SVT)  or  fixed-point  continuation 
(FPCA)  to  solve  these  convex  programs  [27-29] ;  however,  the  quality  of  the  solutions  depends 
somewhat  on  the  algorithm,  and  it  is  an  open  problem  to  explore  this  in  more  detail. 

For  the  MLE,  we  used  the  iterative  algorithm  of  [73],  which  has  previously  been  used  in 
e.g.  [74].  This  method  converged  on  every  example  we  tried,  so  we  did  not  have  to  use  the  more 
sophisticated  ‘diluted’  method  of  [75]  that  guarantees  convergence. 

For  the  purposes  of  our  simulations,  we  sampled  our  Pauli  operators  without  replacement. 

5.4.  Results  and  analysis 

The  data  are  depicted  in  figure  1.  The  three  subfigures  (a)-(c)  use  three  different  values  for 
the  total  sampling  time  T  in  increasing  order.  We  have  plotted  both  the  fidelity  and  the  trace 
distance  (inset)  between  the  recovered  solution  and  the  true  state. 

Several  features  are  immediately  apparent.  First,  we  see  that  both  of  the  compressed 
sensing  estimators  consistently  outperform  MLE  along  a  wide  range  of  values  of  m,  the  number 
of  Paulis  sampled,  even  when  we  choose  the  optimal  value  of  m  just  for  the  MLE.  Once  m 
is  sufficiently  large  (but  still  m  d^)  all  of  the  estimators  converge  to  a  reasonably  high- 
fidelity  estimate.  Thus,  it  is  not  just  the  compressed  sensing  estimators  which  are  capable  of 
reconstructing  nearly  low-rank  states  with  few  measurement  settings,  but  rather  this  seems  to  be 
a  generic  feature  of  many  estimators.  However,  the  compressed  sensing  estimators  seem  to  be 
particularly  well  suited  for  the  task. 

When  the  total  amount  of  time  T  is  very  small  (figure  1(a)),  then  there  is  a  large  advantage 
of  using  compressed  sensing.  In  this  regime,  if  we  insist  on  making  an  informationally 
complete  measurement,  there  is  barely  enough  time  to  make  one  measurement  per  setting, 
so  the  measurement  results  are  dominated  by  statistical  fluctuations,  i.e.  the  signal-to-noise 
ratio  is  poor.  Compressed  sensing  techniques — measuring  an  incomplete  set  of  observables, 
and  regularizing  the  solution  to  favor  low-rank  states — lead  to  much  better  results.  However, 
note  that  in  this  situation,  it  is  very  difficult  for  any  method,  including  compressed  sensing,  to 
achieve  extremely  high-fidelity  state  reconstructions  (with  fidelity  greater  than  95%,  say). 

As  the  total  amount  of  time  available  increases,  all  of  the  estimators  of  course  converge 
to  higher  fidelity  estimates.  Interestingly,  for  the  compressed  sensing  estimators,  there  seems  to 
be  no  trade-off  whatsoever  between  the  number  of  measurement  settings  and  the  fidelity,  once 
T  and  m  are  sufficiently  large.  That  is,  the  curve  is  completely  flat  as  a  function  of  m  above  a 
certain  cutoff.  This  is  most  notable  for  the  matrix  Lasso,  which  consistently  performs  at  least  as 
well  as  the  matrix  Dantzig  selector,  and  often  better.  These  observations  are  consistent  with  the 
bounds  proven  earlier. 

The  flat  curve  as  a  function  of  m  is  especially  interesting,  because  it  suggests  that  there 
is  no  real  drawback  to  using  small  values  of  m.  Smaller  values  of  m  are  attractive  from  the 
computational  point  of  view  because  the  runtime  of  each  reconstruction  algorithm  scales  with 
m.  This  makes  a  strong  case  for  adopting  these  estimators  in  the  future,  and  at  the  very  least  more 
numerical  studies  are  needed  to  investigate  how  well  these  estimators  perform  more  generally. 

We  draw  the  following  conclusion  from  these  simulations.  The  best  performance  for  a 
fixed  value  of  T  is  given  by  the  matrix  Lasso  estimator  of  equations  (4)  and  (35)  in  the  regime 
where  m  is  nearly  as  small  as  possible.  Here  the  fidelity  is  larger  than  the  other  estimators  (if 
only  by  a  small  or  negligible  amount  when  T  is  large),  but  using  a  small  value  for  m  means 
smaller  memory  and  processing  time  requirements  when  doing  the  state  reconstruction.  MLE 
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Number  of  Sampled  Paulis  (m) 
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Fidelity  and  Trace  Distance  T=270000,  c=20 


Figure  1.  Fidelity  and  trace  distance  (inset)  as  a  function  of  m,  the  number 
of  sampled  Paulis.  Plots  (a),  (b)  and  (c)  are  for  a  total  measurement  time  of 
T  =  41k,  80k  and  270k,  respectively,  in  units  where  the  cost  to  measure  a  single 
copy  is  one  unit  of  time,  and  the  cost  to  switch  measurement  settings  is  c  =  20 
units.  The  three  estimators  plotted  are  the  matrix  Dantzig  selector  (equation  (34), 
blue),  the  matrix  Lasso  (equation  (35),  red)  and  a  standard  MLE  (green).  Each 
bar  shows  the  mean  and  standard  deviation  from  120  Haar-random  five-qubit 
states  with  1%  local  depolarizing  noise.  Our  estimators  consistently  outperform 
MEE,  even  after  optimizing  the  MEE  over  m.  (We  do  not  know  what  precisely 
causes  the  degradation  of  the  performance  of  MEE  in  figure  1  a,  but  it  is  plausible 
to  speculate  that  the  cost  term  c  is  responsible  for  this  effect).  See  the  main  text 
for  more  details. 


consistently  underperforms  the  compressed  sensing  estimators,  but  still  seems  to  ‘see’  the  low- 
rank  nature  of  the  underlying  state  and  converges  to  a  reasonable  estimate  even  when  m  is  small. 

6.  Process  tomography 

Compressed  sensing  techniques  can  also  be  applied  to  quantum  process  tomography.  Here,  our 
method  has  an  advantage  when  the  unknown  quantum  process  has  a  small  Kraus  rank,  meaning 
that  it  can  be  expressed  using  only  a  few  Kraus  operators.  This  occurs,  for  example,  when  the 
unknown  process  consists  of  unitary  evolution  (acting  globally  on  the  entire  system)  combined 
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Figure  2.  Compressed  quantum  process  tomography  using  an  ancilla.  The 
quantum  circuit  represents  a  single  measurement  setting,  where  one  measures 
the  observable  on  the  system  and  Pg  on  the  ancilla. 


with  local  noise  (acting  on  each  qubit  individually,  or  more  generally,  acting  on  small  subsets 
of  the  qubits). 

Consider  a  system  of  n  qubits,  with  Hilbert  space  dimension  d  =  2" .  Let  5  be  a  completely 
positive  trace-preserving  (CP)  map  from  to  and  suppose  that  E  has  Kraus  rank 

r.  Using  compressed  sensing,  we  will  show  how  to  characterize  E  using  m=  0{rd\ogd) 
settings.  (For  comparison,  standard  process  tomography  requires  d^  settings,  since  E  contains 
d^  independent  parameters  in  general.)  Furthermore,  our  compressed  sensing  method  requires 
only  the  ability  to  prepare  product  eigenstates  of  Pauli  operators  and  make  Pauli  measurements, 
and  it  does  not  require  any  ancilla  qubits. 

We  remark  that,  except  for  the  ancilla-assisted  method,  just  the  notion  of  ‘measurement 
settings’  for  process  tomography  does  not  capture  all  of  the  complexity  because  of  the  need  to 
have  an  input  to  the  channel.  Here  we  define  one  ‘input  setting’  to  be  a  basis  of  states  from 
which  the  channel  input  should  be  sampled  uniformly.  Then  the  total  number  of  settings  m  is 
the  sum  of  the  number  of  measurement  settings  (Paulis,  in  our  case)  and  input  settings.  This 
definition  justifies  the  claim  that  the  number  of  settings  for  our  compressed  process  tomography 
scheme  is  m  =  0(rd^  log  d). 

The  analysis  here  focuses  entirely  on  the  number  of  settings  m.  We  forgo  a  detailed  analysis 
of  t,  the  sample  complexity,  and  instead  leave  this  open  for  future  work. 

Note  that  there  is  a  related  set  of  techniques  for  estimating  an  unknown  process  that  is 
element-wise  sparse  with  respect  to  some  known,  experimentally  accessible  basis  [19].  These 
techniques  are  not  directly  comparable  to  ours,  since  they  assume  a  greater  amount  of  prior 
knowledge  about  the  process,  and  they  use  measurements  that  depend  on  this  prior  knowledge. 
We  will  discuss  this  in  more  detail  at  the  conclusion  of  this  section. 

6.1.  Our  method 

First,  recall  the  Jamiolkowski  isomorphism  [76]:  the  process  E  is  completely  and  uniquely 
characterized  by  the  state 

P£  =  iE®miro}{M, 

where  |i/fo)  =  Yl‘j=i  li)  ®  li)-  Note  that  when  E  has  Kraus  rank  r,  the  state  pg  has  rank  r. 
This  immediately  gives  us  a  way  to  do  compressed  quantum  process  tomography:  first  prepare 
the  Jamiolkowski  state  p£  (by  adjoining  an  ancilla,  preparing  the  maximally  entangled  state 
|T/^o)^  and  applying  E)',  then  do  compressed  quantum  state  tomography  on  pf ;  see  figure  2. 

We  now  show  a  more  direct  implementation  of  compressed  quantum  process  tomography 
that  is  equivalent  to  the  above  procedure,  but  does  not  require  an  ancilla.  Observe  that  in  the 
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Figure  3.  Compressed  quantum  process  tomography,  implemented  directly 
without  an  ancilla.  Here  one  prepares  a  random  eigenstate  of  Pg,  applies  the 
process  S  and  measures  the  observable  Pa  on  the  output. 


above  procedure,  we  need  to  estimate  expectation  values  of  the  form 
Tr((P^  <8)  Pb)Ps)  =  Tr((PA  <8)  PB)i£ 

where  Pa  and  Pb  are  Pauli  matrices.  By  using  the  Kraus  decomposition,  it  is  straightforward  to 
derive  the  equivalent  expression 

Tr((PA  (8)  PB)Pe)  =  ^Tt(Pa£(P^)),  (37) 

a 

where  the  bar  denotes  complex  conjugation  in  the  standard  basis. 

We  now  show  how  to  estimate  the  expectation  value  (37).  Let  Xj  and  \^j)  denote  the 
eigenvalues  and  eigenvectors  of  Pg-  Then  we  have 

1 

Tr((PA  (8)  Pb)Ps)  =  -  ^  A,Tr(PAP(|</),)((/.,|)). 

^  7  =  1 

To  estimate  this  quantity,  we  repeat  the  following  experiment  many  times,  and  average  the 
results:  choose  j  G  [r/]  uniformly  at  random,  prepare  the  state  \^j),  apply  the  process  £,  measure 
the  observable  Pa,  and  multiply  the  measurement  result  by  Xj.  (See  figure  3.)  In  this  way,  we 
learn  the  expectation  values  of  the  Jamiolkowski  state  p£  without  using  an  ancilla.  We  then  use 
compressed  quantum  state  tomography  to  learn  pg,  and  from  this  we  recover  £.  Note  that  since 
the  approach  described  here  can  also  be  applied  to  the  case  that  E  is  not  trace-preserving,  it 
applies  to  detector  tomography  as  well. 

6.2.  Related  work 

Our  method  is  somewhat  different  from  the  method  described  in  [19].  Essentially,  the  difference 
is  that  our  method  works  for  any  quantum  process  with  a  small  Kraus  rank,  whereas  the  method 
of  Shabani  et  al  works  for  a  quantum  process  that  is  element-wise  sparse  in  a  known  basis 
(provided  this  basis  is  experimentally  accessible  in  a  certain  sense).  The  main  advantage  of  the 
Shabani  et  al  method  is  that  it  can  be  much  faster:  for  a  quantum  process  E  that  is  ^'-sparse  (i.e. 
has  s  non-zero  matrix  elements),  it  requires  only  0{s  log  J)  settings.  The  main  disadvantage  is 
that  it  requires  more  prior  knowledge  about  E  and  is  more  difficult  to  apply.  While  it  has  been 
demonstrated  in  a  number  of  scenarios,  there  does  not  seem  to  be  a  general  recipe  for  designing 
measurements  that  are  both  experimentally  feasible  and  effective  in  the  sparse  basis  of  E. 

To  clarify  these  issues,  we  now  briefly  review  the  Shabani  et  al  method.  We  assume  that 
we  know  a  basis  T  =  {Talo!  e  \d^Vi  in  which  the  process  E  is  ^-sparse.  For  example,  when  E  is 
close  to  some  unitary  evolution  U ,  one  can  construct  T  using  the  singular- value  decomposition 
(SVD)  basis  of  U .  This  guarantees  that,  if  E  contains  no  noise,  it  will  be  perfectly  sparse  in 
the  basis  T.  However,  in  practice,  E  will  contain  noise,  which  need  not  be  sparse  in  the  basis 
T;  any  non-sparse  components  will  not  in  general  be  estimated  accurately.  The  success  of  the 
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Shabani  et  al  method  therefore  rests  on  the  assumption  that  the  noise  is  also  sparse  in  the  basis 
r.  Although  this  assumption  has  been  verified  in  a  few  specific  scenarios,  it  seems  less  clear 
why  it  should  hold  in  general.  By  comparison,  our  method  simply  assumes  that  the  noise  is 
described  by  a  process  matrix  that  is  low  rank;  this  can  be  rigorously  justified  for  any  noise 
process  that  involves  only  local  interactions  or  few-body  processes. 

The  other  complication  with  the  Shabani  et  al  method  concerns  the  design  of  the  state 
preparations  and  measurements.  On  the  one  hand,  these  must  satisfy  the  RIP  condition  for 
5 -sparse  matrices  over  the  basis  T;  on  the  other  hand,  they  must  be  easy  to  implement 
experimentally.  This  has  been  demonstrated  in  some  cases,  by  using  states  and  measurements 
that  are  ‘random’  enough  to  show  concentration  of  measure  behaviors,  but  also  have  tensor 
product  structure.  However,  these  constructions  are  not  guaranteed  to  work  in  general  for  an 
arbitrary  basis  T. 

We  leave  open  the  problem  of  doing  a  comparative  study  between  these  and  other 
methods  [77,  78],  akin  to  [79]. 

7.  Conclusion 

In  this  work,  we  have  introduced  two  new  estimators  for  compressed  tomography:  the  matrix 
Dantzig  selector  and  the  matrix  Lasso  (equations  (3)  and  (4)).  We  have  proved  that  the  sample 
complexity  for  obtaining  an  estimate  that  is  accurate  to  within  e  in  the  trace  distance  scales  as 
O  log  d)  for  rank-r  states  and  that  for  higher  rank  states,  the  additional  error  is  proportional 
to  the  truncation  error.  This  error  scaling  is  optimal  up  to  constant  multiplicative  factors,  and 
requires  measuring  only  O  (rd  poly  log  r/)  Pauli  expectation  values,  a  fact  we  proved  using 
the  RIP  [30].  We  also  proved  that  our  sample  complexity  upper  bound  is  within  poly  log  d 
of  the  sample  complexity  of  the  optimal  minimax  estimator,  where  the  risk  function  is  given 
by  a  trace  norm  confidence  interval.  We  showed  how  a  modification  of  DFE  can  be  used  to 
unconditionally  certify  the  estimate  using  a  number  of  measurements  which  is  asymptotically 
negligible  compared  to  obtaining  the  original  estimate.  We  numerically  simulated  our  estimators 
and  found  that  they  outperform  MLE,  giving  higher  fidelity  estimates  from  the  same  amount 
of  data.  Einally,  we  generalized  our  method  to  quantum  process  tomography  using  only  Pauli 
measurements  and  preparation  of  product  eigenstates  of  Pauli  operators. 

There  are  many  interesting  directions  for  future  work.  One  major  open  problem  is  to  switch 
the  focus  from  two-outcome  Pauli  measurements  to  alternative  measurements  which  are  still 
experimentally  feasible.  Eor  example,  measurements  in  a  local  basis  have  2"  outcomes  and  are 
not  directly  analyzable  using  our  techniques.  It  would  be  very  interesting  to  give  an  analysis  of 
our  estimators  from  the  perspective  of  such  local  basis  measurements.  One  difficulty,  however, 
is  that  something  like  the  RIP  is  not  likely  to  hold  in  this  case,  so  we  will  need  additional 
techniques. 

Another  direction  is  to  find  better  methods  for  deriving  error  bars  or  confidence  regions 
for  compressed  tomography.  In  principle,  one  can  use  the  error  bounds  in  this  paper  to  estimate 
confidence  regions  (based  on  mild  assumptions  about  the  experimental  setup),  and  then  use 
DEE  to  check  (unconditionally,  i.e.  without  making  any  assumptions)  whether  the  true  state  lies 
within  that  confidence  region.  However,  while  the  error  bounds  in  this  paper  have  roughly  the 
right  asymptotic  scaling  (with  the  rank  r  and  dimension  d),  in  practice  one  may  want  to  know 
finer  properties  of  the  confidence  region,  such  as  its  shape  [63,  64] . 
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A  related  direction  is  to  find  methods  for  rank-based  model  selection  (or,  more  simply, 
estimating  the  rank  of  the  unknown  state)  [80].  In  principle,  one  can  always  ‘guess  and  check’ 
the  rank,  using  DFE;  but  a  more  sophisticated  approach  may  perform  better  in  practice. 

Some  other  open  problems  are  to  tighten  the  gap  that  remains  between  the  sample 
complexity  upper  and  lower  bounds,  and  to  try  to  prove  optimality  with  respect  to  alternative 
criteria,  in  addition  to  minimax  risk.  For  example,  it  would  be  interesting  to  find  a  useful  notion 
of  average  case  optimality. 

On  the  numerical  side,  some  of  the  open  questions  are  the  following.  First,  it  would  be  very 
interesting  to  carry  out  a  more  detailed  numerical  study  of  the  performance  of  our  estimators. 
While  they  have  clearly  outperformed  MFE  in  the  simulations  reported  here,  there  is  no  question 
that  this  is  a  narrow  range  of  parameters  on  which  we  have  tested  these  estimators.  It  would  be 
interesting  to  do  additional  comparative  studies  between  these  and  other  estimators  to  see  how 
robust  these  performance  enhancements  are.  It  would  also  be  very  interesting  to  study  fast  first- 
order  solvers  such  as  [27-29]  which  work  particularly  well  for  approximately  low -rank  matrices 
and  which  could  compute  estimates  on  a  large  number  of  qubits  (ten  or  more). 

The  success  of  the  estimators  we  studied  depends  on  being  able  to  find  good  values  for 
the  parameters  X  and  /x.  While  we  have  used  simple  heuristics  to  pick  particular  values,  a 
detailed  study  of  the  optimal  values  for  these  parameters  could  only  improve  the  quality  of  our 
estimators.  Moreover,  MFE  seems  to  enjoy  the  same  ‘plateau’  phenomenon,  where  the  quality 
of  the  estimate  is  insensitive  to  m  above  a  certain  cutoff.  This  leads  us  to  speculate  that  this  is  a 
generic  phenomenon  among  many  estimators,  and  that  perhaps  there  are  even  better  choices  for 
estimators  than  the  ones  we  benchmark  here. 
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